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Abstract 

Regular  meshes  are  frequently  used  for  modeling  physical  phenomena  on  both  serial 
and  parallel  computers.  One  advantage  of  regular  meshes  is  that  efficient  discretization 
schemes  can  be  implemented  in  a  straightforward  manner.  However,  geomctrically- 
complex  objects,  such  as  aircraft,  cannot  be  easily  described  using  a  single  regular  mesh. 
Multiple  interacting  regular  meshes  are  frequently  used  to  describe  complex  geometries. 
Eac''  mesh  models  a  subregion  of  the  physical  domain.  The  meshes,  or  subdomatns, 
can  be  processed  in  parallel,  with  periodic  updates  carried  out  to  move  information 
between  the  coupled  meshes.  In  many  cases,  there  are  a  relatively  small  number  (one 
to  a  few  dozen)  subdomains,  so  that  each  subdomain  may  also  be  partitioned  among 
■sev'eral  processors. 

We  outline  a  composite  run-time/compile-time  approach  for  supporting  these  prob¬ 
lems  efficiently  on  distributed-memory  machines.  This  paper  describes  these  methods 
in  the  context  of  a  multiblock  fluid  dynamics  problem  developed  at  the  NASA  Langley 
Research  Center. 
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1  Introduction 


W^e  are  developing  methods  for  porting  programs  with  irregularly  coupled  regular  meshes 
(ICRMs)  commonly  known  as  multiblock  applications,  to  distributed-memory  parallel  com¬ 
puters.  In  order  to  ensure  that  our  techniques  are  applicable  to  real-wo^’ld  problems,  we 
have  begun  our  research  with  a  specific  multiblock  problem  from  the  domain  of  computa¬ 
tional  fluid  dynamics.  .41though  our  initial  focus  was  multiblock  CFD,  we  aim  to  produce 
methods  that  are  applicable  to  all  parallel  codes  that  meet  the  following  criteria: 

•  The  data  is  divided  into  several  interacting  regions  (typically  called  subdomains). 

•  1  here  exists  a  computational  phase  in  which  work  on  each  subdomain  can  be  carried 
out  independently. 

•  Data  access  patterns  within  each  subdomain  are  regular. 

•  Communication  between  subdomains  is  limited  to  rectangular  sections  of  data  that 
are  exchanged  between  subdomains. 

In  many  problems  there  are  at  most  a  few  dozen  subdomains  of  varying  sizes.  We  can  as¬ 
sume  that  we  will  have  to  assign  at  least  some  of  the  subdomains  to  multiple  processors,  we 
must  con.scquently  be  prepared  to  deal  with  multiple  levels  of  parallelism  in  ICRM  codes.  A 
model  of  an  ICRM  application  is  shown  in  Figure  1.  Typically  ICRM  applications  have  two 
levels  of  parallelism  available.  A  coarse-grained  parallelism  is  available  for  processing  the 
subdomains  concurrently.  Fach  subdomain  is  a  self-contained  computation  region  that  can. 
except  for  boundary  conditions,  be  operated  upon  independently  of  the  other  subdomains. 
In  addition,  the  computation  for  individual  siibdomains  has  fine-grain  parallelism  available. 
In  order  to  achieve  efficient  execution  of  ICRM  applications  on  distributed-memory  multi¬ 
computers,  both  levels  of  parallelism  must  be  exploited.  .Applying  coarse-grained  parallelism 
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Figure  1;  ICRM  Application  Model 


will  help  to  keep  communication  overhead  to  a  manageable  fraction  of  the  computation  time. 
However,  since  the  number  of  subdomains  is  relatively  small,  particularly  when  compared 
to  the  number  of  processing  elements  in  current  distributed-memory  multicomputers,  the 
coarse-grained  parallelism  between  subdomains  will  not  provide  sufficient  parallel  activity 
to  keep  all  processors  busy.  The  fine-grained  parallelism  within  each  subdomain  must  be 
used  to  fill  this  gap. 

The  methods  we  are  developing  to  support  ICRM  applications  are  semi-automatic  and 
include  both  compile-time  and  runtime  support  for  partitioning  and  communication.  We 
have  developed  and  benchmarked  on  the  Intel  iPSC/860  the  runtime  support  required  to 
carry  out  the  required  patterns  of  interprocessor  data  motion.  We  have  also  developed  a  very 
rudimentary  compiler  prototype  to  embed  this  runtime  support.  The  compiler  produces, 
as  output.  Fortran  77  code  that  can  be  compiled  and  run  on  a  distributed-memory  parallel 
computer.  I  his  compiler  prototype  was  built  to  experimentally  define  what  will  be  needed 
to  effectively  support  ICRM  computations. 
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Our  ultimate  goal  in  this  work  is  to  provide  language-level  support  for  ICRMs  in  a 
general-purpose  parallel  language  like  Fortran  D[FHK‘''90].  We  concentrate  here  on  de¬ 
scribing  the  functionality  that  must  be  added  to  such  a  language  to  handle  ICRMs,  and 
implementation  techniques  that  efficiently  support  that  functionality.  In  the  course  of  our 
work,  we  have  defined  extensions  to  Fortran  D  that  are  useful  for  these  problems;  these  are 
a  means  to  an  end,  not  the  final  product.  Although  we  strongly  believe  that  the  functions 
provided  by  these  new  features  will  be  critical  for  ICRM  support,  we  believe  that  further 
work  is  needed  to  define  appropriate  syntax.  We  are  currently  collaborating  with  Rice  to 
develop  Fortran  D  extensions  which  capture  the  functionality  we  describe  in  this  paper. 

1.1  Problem  Overview 

The  application  we  investigated  is  a  problem  from  the  domain  of  computational  fluid  dy¬ 
namics.  The  serial  code  was  developed  by  V.  Vasta  at  the  NASA  Langley  Research  Center, 
and  solves  the  thin-layer  Navier-Stokes  equations  for  a  fluid  flow  over  a  three-dimensional 
surface  with  complex  geometry.  The  problem  geometry  is  decomposed  into  between  one  and 
a  few  dozen  distinct  regions,  each  of  which  is  modeled  with  a  regular,  three-dimensional, 
rectangular  grid.  Ti  e  boundary  conditions  of  each  region  are  enforced  by  simulating  any 
of  several  situations  including;  viscous  and  inviscid  walls,  symmetry  planes,  extrapolation 
conditions,  and  interaction  with  an  adjacent  region.  The  size  of  each  region  (hereafter  sub- 
domain),  its  boundary  conditions  and  adjacency  information  are  loaded  into  the  program 
at  run  time.  For  this  application,  the  same  program  is  run  on  all  subdomains.  However, 
different  subroutines  will  be  executed  when  applying  the  boundary  conditions  on  different 
subdomains.  In  general,  the  code  u.sed  to  process  each  subdomain  of  an  ICRM  application 
may  he  ''iffereiit. 

riie  .sequence  of  activity  for  this  program  is  as  follows; 
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Read  subdomain  sizes,  boundary  conditions  and  simulation  parameters, 


Repeat  (typically  large  number  of  times): 

A.  Apply  boundary  conditions  to  all  subdomains, 

B.  Carry  out  computations  on  each  subdomain. 

I'lie  main  body  of  t  lie  program  consists  of  an  outer  sequential  loop,  and  two  inner  parallel 
loops.  Kach  of  tlie  inner  loops  iterate  over  the  subdomains  of  the  problem,  the  first  applying 
boundary  conditions  (Stej)  A),  which  may  involve  interaction  with  other  subdomains,  and 
the  second  loop  advancing  the  physical  simulation  one  lime  step  in  each  subdomain  (Step  B). 
Partitioning  of  the  parallel  loops  is  the  source  of  the  coarse-grained  parallelism  for  the 
application.  Furthermore,  within  each  iteration  of  the  loop  that  implements  Step  B  there 
is  fine-grained  parallelism  available  in  the  form  of  large  parallel  loops. 

1.2  Compiler  Overview 

To  investigate  the  extent  to  which  ICRM  applications  can  automatically  be  transformed 
for  execution  on  a  distributed-memory  multicomputer,  we  designed  a  rudimentary  compiler 
geared  toward  applying  the  specific  set  of  transformations  required  by  ICRMs.  The  com¬ 
piler  is  built  using  the  Sigma  Toolkit  [CLS''‘91],  which  provides  dependency  and  dataflow 
analysis.  Sigma  also  provides  a  framework  for  applying  transformations  to  programs  and 
includes  support  for  common  dialects  of  Fortran,  C  and  C-t--f .  As  the  main  focus  of  the 
compiler  was  ICRM  applications,  a  number  of  important  compiler  functions  were  not  im¬ 
plemented.  Bather  than  duplicate  the  efforts  of  ongoing  or  existing  distributed-memory 
compiler  projects,  such  as  Fortran  1)  [1IKT91],  Superb  [ZlKiSb]  [GerS!)],  and  ,\L  [7'se!)0], 
wiiich  have  investigated  many  of  the  fundamental  issues  in  dislributed-memory  compiling, 
our  ex()erimental  cottipiler  us('s  technicpies  which  are  complementary  to  these  other  ap- 
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Table  1:  Compiler  Transformations 


_ Parallelism _  Communication 

Between  Subdomains  Owner  computes  Subarray  Exchanges 

loop  bounds  replaced  replace  copies  of 

with  function  calls  regular  array  sections 

_ with  procedure  calls 

Within  a  Subdomain  Owner  computes  Overlap  Cells 

loop  bounds  replaced  size  of  overlap  determined 
with  function  calls  at  compile  time 

procedure  calls  embedded 
to  implement  communication 


proaches:  applying  specific  transformations  to  those  sections  of  the  program  that  exhibit 
characteristic  ICRM  behavior. 

The  transformations  performed  by  the  compiler  can  be  organized  into  four  general  cate¬ 
gories,  as  shown  in  Table  1.  The  basic  responsibilities  of  the  compiler  for  ICRM  applications 
are  to  handle  the  coarse  grain  parallelism  between  subdomains,  the  fine  grain  parallelism 
within  a  subdomain,  and  to  ensure  that  the  required  communication  takes  place  for  both 
levels  of  parallelism.  As  our  principal  objectives  were  to  determine  the  level  of  functional¬ 
ity  required  to  handle  ICRM  applications,  and  to  establish  the  potential  for  a  compiler  to 
automatically  transform  annotated  ICRM  programs  for  distributed-memory  environments, 
the  major  focus  of  the  compiler  is  to  embed  procedure  calls  to  the  enhanced  P.ARTI  runtime 
library. 

Our  transformations  introduce  both  fine  and  coarse-grained  parallelism  into  the  program 
by  enforcing  the  owner  computes  rule.  Communication  within  a  subdomain  is  implemented 
using  overlap  cells,  utilizing  both  compile-tinie  and  runtime  components.  Communication 
between  subdomains  is  provided  through  runtime  support  for  exchanging  regular  array 
s('rtions.  Procedure  calls  to  perform  the  data  motion  ar»'  inserted  into  the  program  by  the 


compiler. 


1.3  Organization 

The  remainder  of  this  paper  is  organized  as  follows.  In  Section  2,  the  directives  vve  use 
in  the  annotated  version  of  the  program  are  explained.  Section  3  outlines  the  runtime 
library.  Section  4  describes  the  parallelization  of  the  computation  for  individual  subdomains. 
In  Section  5  we  describe  the  techniques  we  developed  for  achieving  parallel  execution  of 
multiple  subdomains. 

2  Fortran  Directives 

.\s  part  of  our  investigation  into  ICRM  applications,  we  have  identified  the  functionality 
needed  to  express  data  layout  and  organization  on  the  processors.  Integration  of  this  func¬ 
tionality  into  the  Fortran  D  language  is  currently  underway.  As  a  preliminary  step,  we 
have  defined  an  experimental  syntax  for  expressing  this  functionality  in  Fortran  programs, 
and  used  this  syntax  to  tost  our  support  for  ICR.Ms.  Although  we  feel  that  the  expressive 
content  of  our  directives  is  necessary  for  ICRM  applications,  the  directives  themselves  are 
experimental,  and  unlikely  to  be  adopted  for  implementation  in  Fortran  D. 

2.1  Subdomain  Placement 

The  binding  of  subdomains  to  processors  has  important  performance  implications.  Load 
balance  plays  a  crucial  role  in  determining  computational  efficiency.  Since  the  amount 
of  computation  associated  with  each  subdomain  is  directly  i)roportional  to  tb"  number 
of  elements  in  the  subdomain,  good  load  balancing  is  achieved  by  binding  processors  to 
suluiomnins  in  a  ratio  proportional  to  their  sizes.  In  our  implementation,  this  mapping  is 
under  user  control  and  is  specified  using  i)rogram  directives. 

file  principal  abstraction  for  dealing  with  data  placement  is  the  tU roin]>osirion.  How- 
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ever,  unlike  fortran  D,  where  decompositions  are  bound  to  the  entire  processor  set,  we  map 
decompositions  to  subsets  of  the  processors.  The  mechanism  for  specifying  this  arrangement 
is  a  directive  called  embed.  The  embed  directive  binds  a  decomposition  to  a  rectangular  sub- 
region  of  another  decomposition.  Any  number  of  decompositions  may  be  embedded  into  a 
single  root  decomposition.  The  root  decomposition  is  mapped  onto  the  entire  set  of  physical 
processors.  Embedded  decompositions  are  mapped  onto  subsets  of  these  processors  based 
on  the  relative  size,  and  location  of  the  subregion  in  the  root  decomposition  to  which  they 
are  bound.  This  methodology  can  easily  be  e.xtended  recursively  to  support  an  arbitrary 
sequence  of  embeddings,  although  for  most  ICRM  applications  we  are  aware  of,  a  two  level 
decomposition  hierarchy  appears  to  be  sufficient.  The  root  level  establishes  a  template  onto 
which  each  subdomain  can  be  mapped. 

For  the  Navier-Stokes  application,  we  use  a  one-dimensional  decomposition  for  the  root 
level,  and  embed  3-dimensional  subdomains  into  it.  For  example,  if  two  subdomains,  one  of 
size  10  X  10  X  10  and  the  other  5  x  5  x  10  were  to  be  mapped  onto  the  physical  processing 
resource,  a  root-level  decomposition  of  size  1,250  would  be  used.  The  first  subdomain  would 
be  embedded  into  locations  1  through  1000  of  this  decomposition,  and  the  second  subdomain 
into  locations  1001  through  1250.  To  clarify  the  distinction  between  the  declaration  of  a 
decomposition  and  the  specification  of  its  size  (which  may  be  runtime  dependent),  wo  use 
two  directives,  decomposition  and  shape,  to  provide  the  same  functionality  as  Fortran  D’s 
decomposition.  This  semantic  partitioning  allows  us  to  conveniently  declare  an  array  of 
decompositions  to  hold  the  set  of  subdomains.  The  dimensionality  and  size  of  each  of  the 
decompositions  in  this  array  is  determined  dynamically  by  the  .s/mpe  directive.  .‘Mthough.  in 
this  example,  the  sizes  are  constants,  in  general,  for  an  ICRM  application,  the  subdomain 
sizes  are  not  known  until  runtime. 


2.2  Distributing  Array  Data 


In  our  iinplemeiitation,  the  arrays  that  make  up  each  subdomain  are  distributed  using  the 
Fortran  I)  align  directive.  However,  since  tiie  number  of  subdomains,  and  their  sizes,  are 
not  known  until  runtime,  we  allocate  space  using  a  single,  one-dimensional  w'ork  array.  To 
make  it  possible  to  allocate  space  for  multiple  decompositions  (or  multiple  elements  of  an 
array  of  decompositions)  using  a  single  work  array,  the  align  directive  was  extended  to  allow 
array  reshaping.  Our  implementation  of  align  supports  the  arbitrary  reshaping  of  a  region 
of  memory  into  multidimensional,  distributed  arrays. 

3  Run  Time  Support 

I’he  runtitiK'  support  contains  a  number  of  P.ARTI  procedures  which  carry  out  the  book¬ 
keeping  needed  to  track  the  distributed  arrays  that  describe  ICRM  problems.  This  runtime 
support  is  a  generalized  version  of  the  runtime  support  described  in  [BSS91].  The  major 
functions  in  the  runtime  library  are  listed  in  Table  2. 

riiere  are  two  principal  data  structures  that  are  created  and  maintained  in  the  runtime 
library,  rhose  data  structures  are  distributed  array  descriptors,  and  communication  sched- 
iil(  s.  riie  distributed  array  descriptor  is  a  data  structure  that  tracks  a  variety  of  attributes 
associated  with  each  distributed  array,  including: 

•  array  dimensionality  and  size, 

•  tlu'  number  of  overlap  cells  (.see  Section  1)  in  each  dimension, 

•  array  (list rilnition  in  each  dimension,  and 

•  the  set  of  y)rocessors  to  which  the  distributed  array  is  mapped. 

Communication  schedules  are  data  structures  that  describe  how  a  specific  data  transfer  is 
to  be  performed  including: 


Table  2:  Runtime  Library 


Distribution  Declarations 

c  reate -decom  position 

instantiates  a  decomposition 

embed 

maps  decompositions  to  processors 

distribute 

establishes  distribution  pattern 
for  decompositions 

align 

binds  arrays  to  decompositions 
creates  distributed  array  descriptor 
records  overlap  region  sizes 

C  o  m  m  u  n  i  c  at  ion  P  r  i  n  i  i  t  i  ves 

e.xchsched 

makes  schedule  for  overlap  regions 

subarraysched 

makes  schedule  for  subarray  exchange 

da  t  amove 

executes  a  schedule  (communicates  data) 

•  individual  send  and  receive  lists  on  each  processor,  and 

•  data  access  patterns  for  moving  data  between  arrays  and  message  buffers. 

The  communication  schedule,  or  achedule.  allows  us  to  implement  data  motion  as  a  two- 
phase  process.  Commonly  known  as  Inspector/Flxecutor,  this  methodology  uses  a  prepro¬ 
cessing  stage  to  determine  the  set  of  low-level  communications  primitives  which  must  be 
used  to  transfer  the  data.  .-X  second  stage  then  implements  the  data  communication.  This 
mechanism  has  been  applied  to  irregular  problems  in  the  PARTI  system  [SCMB90],  and  to 
both  regular  and  irregular  problems  as  part  of  the  maparray  construct  in  Paragon  [CCRS91]. 

Table  2  is  organized  into  two  components.  The  upper  section  of  the  table  shows  the 
primitives  used  to  define  the  distribution  of  array  data.  The  low^er  section  lists  the  primitives 
used  to  [)erform  data  communication.  These  primitives  can  be  used  directly  to  program 
K'RM  applications,  or  can  be  embedded  into  the  program  automatically. 

The  |>roredure  create-decomposition  defines  a  new'  decomposition  with  a  given  di¬ 
mensionality  and  size  in  each  dimension.  The  procedure  embed  implements  the  fttdxd 
directive  (see  Section  2.1)  and  constrains  the  set  of  processors  associaterl  with  a  decom 
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position.  A  decomposition  that  has  not  been  embedded  into  another  decomposition  is,  by 
default,  mapped  to  all  processors. 

The  distribute  procedure  defines  the  type  of  distribution  for  each  dimension  of  a  de¬ 
composition  (e.g.  BLOCK,  CYCLIC  or  IRREGULAR)  and  is  used  to  implement  the 
Fortran  D  distribute  directive. 

The  align  procedure  implements  the  align  directive  and  is  used  to  associate  arrays  with 
(h'cornpositions  and  to  create  distributed  array  descriptors.  The  compiler  determines  the 
number  of  overlap  cells  for  each  array  dimension  and  passes  this  information  to  align. 
Align  writes  the  distributed  array  descriptor  into  a  hash  table,  organized  by  array  starting 
address.  Using  the  hash  table  allows  arrays  to  be  passed  as  parameters  between  subroutines, 
transparently  inheriting  the  distribution  information  from  the  calling  procedure.  Alterna¬ 
tively.  the  distribution  data  can,  in  some  cases,  be  traced  interprocedurally  at  compile  time. 
Hiranandani  ct  al  deiine  a  process  known  as  reaching  decompositions  which  can  be  used  to 
analyze  array  <listribution  both  intra  and  inter  proceduraUy  [HKT91]. 

riie  communication  primitives  include  a  procedure  exchsched  which  computes  a  sched¬ 
ule  that  is  used  to  direct  the  filling  of  overlap  cells  along  a  given  dimension  of  a  distributed 
array.  The  schedule  specifies  required  intra-processor  data  copying  along  with  a  set  of  send 
and  receive  calls. 

The  primitive  subarraysched  carries  out  the  preprocessing  required  to  copy  the  con¬ 
tents  of  a  regular  section,  source,  in  one  subdomain  into  a  regular  section,  destination. 
in  another  subdomain.  The  interactions  between  subdomains  for  ICRM  applications  are 
limited  to  the  exchange  of  regular  subsections,  as  illustrated  in  Figure  2.  The  subar¬ 
raysched  prindtive  supports  data  moves  l)etween  arbitrary  rectangular  sections  of  two 
subdomains,  and  can  transpose  the  data  along  any  dimension.  Subarraysched  can  also 
copy  the  contents  of  a  regular  section  in  a  given  subdomain  into  another  regular  section 
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Move  a  section  of  a 
face  from  one  subdomain 


To  a  section  of  a 
face  on  a 

different  subdomain 


in  the  same  subdomain.  Subarraysched  produces  a  schedule  which  specifies  a  pattern 
of  intra-processor  data  transfers  along  with  a  sot  of  send  and  receive  calls.  The  primitive 
subarraysched  executes  on  each  processor.  On  a  given  processor  F,  subarraysched  must 
find  out  wlu’ther  it  owns  any  portion  of  source.  If  P  does  own  some  portion,  sourccp,  of 
sourer,  subarraysched  mu.st  calculate  the  processors  to  which  various  subsets  of  sourcep 
will  have  to  be  sent.  Subarraysched  imist  also  calculate  whether  processor  P  owns  any 
[iort  ion  of  destination  and.  if  so,  it  must  prepare  to  receive  the  appropriate  messages. 

.Schedules  produced  by  exchsched  and  subarraysched  arc  employed  by  a  primitive 
called  datamove  that  carries  out  communication  and  intra-processor  data  copying. 

4  Computation  Within  a  Subdomain 

riie  computation  within  a  suixiomain  rerpiires  predominantly  near-neighbor  communi- 


11 


do  k=:  l.kinx 
do  j=l.jmx 
do  i=  1  ,iinx 

(IS  =  (skx(i.j,k)  +  sky(i.j,k)  + 
skz(i.j,k))/ra 

ra  =  0.r)*(\v(i,j,k+ 1 )  +  w(ij,k)) 
lis(ij,k)  =  <is*ra 

ra  =  0.5*(w(i.j+l,k)  +  w(i,j-t'l,k)) 

Ss(i,j,k)  =  (is*ra 

ra  =  0.5*(\v(i+l,j,k)  +  \v(i,j.k)) 

fs(i,j,k)  =  qs*ra 

enddo 

enddo 

enddo 

Figure  3:  Fxaniple  Code  for  Sweep  Over  a  Single  Subdomain 


(at  ion.  A  typical  loop  nest  for  this  component,  of  an  ICRM  application  is  shown  in  Figure  3. 
I'lie  loop  iK'st  is  computationally  intensive,  with  no  loop-carried  data  dependencies.  Be¬ 
cause  the  communication  is  regular,  this  code  can  be  efficiently  handled  by  the  overlap  cell 
method  described  by  Cerndt  in  [Ger90].  Our  compiler  transforms  this  code  as  follows: 

•  Overlap  colls  are  determined  by  scanning  every  subroutine  in  the  procedure  and  ac¬ 
cumulating  the  data  in  an  interprocedural  analysis  phase  of  the  compiler.  When  two 
subroutines  have  different  overlaj)  cell  recjuirements,  the  maximum  of  the  two  values 
is  used.  The  final  value  for  the  number  of  overlap  cells  for  each  dimension  of  every 
array  must  be  a  constant. 

•  i,o(al  array  sizes  are  determined  dynamically  at  subroutine  boundaries.  The  array 
sizes  are  computed  by  a  function  in  the  run  time  library,  and  includes  the  extra 
memory  rerpiired  for  the  overlap  cells. 

•  l.ooi)s  are  partitioned  to  enforce  the  owner  computes  rule  within  the  loop  body.  For 
this  transformation,  the  compiler  identifies  an  array  appearing  on  the  left  hand  side 
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of  an  assignment  statement  for  which  the  loop-index  is  used  as  a  subscript.  This  loop 
is  tlien  partitioned  in  the  same  manner  as  the  array.  For  multidimensional  arrays,  the 
compih'r  identifies  a  specific  dimension  of  an  array  for  each  loop  in  the  loop  nest. 

4.1  Performance 

Figure  1  shows  the  performance  obtained  while  processing  a  single  64.000  element  subdo¬ 
main.  Figure  5  shows  the  same  data  normalized  by  the  number  of  processors.  The  data  was 
collected  using  an  iPSC/.SGO  multicomputer  processing  a  single  40  x  40  x  40  subdomain.  The 
timings  were  made  from  a  single  routine  which  is  representative  of  the  computation  behav¬ 
ior  of  the  program  while  processing  an  individual  subdomain.  The  curve  labeled  “.Actual" 
shows  the  iierformance,  in  megaflops,  obtained  for  a  single  invocation  of  the  subroutine, 
riie  "Optimistic"  curve  shows  the  performance  that  results  when  the  time  spent  comput¬ 
ing  the  communication  schedules  is  excluded.  Since  schedules  can  be  reused,  this  cost  can 
be  amortized  over  several  invocations  of  the  subroutine.  The  optimistic  curve  reflects  the 
asymptotic  performance  for  several  iterations  of  this  routine. 

riie  "Ideal”  curve  includes  oidy  the  message-passing  time,  and  excludes  the  time  re¬ 
quired  to  create  communication  schedules,  and  the  time  spent  reorganizing  the  data.  For 
a  multi  dimensional  array,  the  elements  that  must  be  transferred  to  fdl  the  overlap  cells 
will  not.  in  general,  be  in  a  contiguous  section  of  memory.  To  transfer  this  data  between 
processors  on  the  iPSC/860,  it  must  be  first  copied  into  a  local  buffer.  After  transmis¬ 
sion.  the  data  is  again  reorganized  as  it  is  placed  into  the  overlap  cells.  The  “Ideal"  curve 
excludes  the  time  spent  packing  and  unpacking  data.  Since  communication  is  ahvays  re¬ 
quired  for  a  distributed-memory  implementation,  this  curve  demonstrates  the  maximum 
possible  [lerformance  for  this  loop  given  the  bandwidth  and  communication  latencies  of  the 

ipsr/s60. 


i;i 


Communicalion-free 
Ideal 
Optimislic 
Actual  X 


Number  of  Processors 


Figurt'  -1:  iPSC/8G0  Siiiglo-Subdoniain  Performance 


riie  curve  labeled  "Communication-free”  shows  the  computation  rate  obtained  when  no 
communication  takes  place.  As  the  partition  size  on  each  processor  decreases,  the  compu¬ 
lation  rate  on  eacli  node  also  decreases.  Thi.*^  effect  is  attributable  to  tlie  increased  relative 
cost  of  loop  overhead  and  pipeline  setup  time.  This  curve  demonstrates  that  even  when 
communication  effects  are  e.xcluded,  a  large  grain  size  will  result  in  better  overall  perfor¬ 
mance.  1  his  data  indicates  the  upper  bound  on  the  performance  imposed  by  the  application 
jtrogram  code  ind  the  if77  compiler. 

5  Supporting  Multiple  Subdomains 

.\n  important  characteristic  of  fCRM  applications  is  the  relative  independence  of  the  sub- 
domains.  Much  of  the  computation  for  a  subdomain  can  be  performed  in  parallel  with 
th('  processing  of  other  subdomains.  As  Figure  5  illustrates,  there  is  a  potential  for  much 


Figure  5;  Per  Node  Single  Subdomain  Performance 


higher  overall  performance  by  partitioning  the  set  of  processors,  and  binding  a  relatively 
small  number  of  processors  to  each  subdomain.  The  cost  of  this  approach  is  periodic  syn¬ 
chronization  between  those  subdomains  which  must  exchange  data. 

5.1  Inter-Subdomain  Communication 

.‘Mthough  the  [)rocessing  of  individual  subdomains  exhibits  regular  communication,  inter¬ 
action  between  subdomains  is  irregular.  An  illustration  of  the  sort  of  communication  that 
is  required  is  shown  in  Figure  (i.  The  figure  shows  two  subdomains,  one  which  models  the 
airflow  around  a  wing  and  another  which  models  the  region  around  a  control  surface  on  the 
wing.  In  this  problem,  there  are  two  boundary  conditions  which  require  inter-subdornain 
communication.  These  boundary  conditions  consist  of  segments  along  exterior  edges  of  the 
grids  that,  in  the  problem  geometry,  are  adjacent.  Although  the  sections  are  rectangular, 
the  beginning  and  ending  points  of  the  sections  are  not  determined  until  runtime.  In  gen¬ 
eral,  the  adjacency  information  for  an  ICRM  is  problem  specific,  and  not  determined  until 
runtime.  However,  an  efficient  implementation  should  be  able  to  take  advantage  of  the  fact 
that  communication  is  limited  to  the  exchange  of  rectangular  sections  of  data. 

Transforming  an  ICRM  application  to  efficiently  handle  this  type  of  data  communication 
begins  by  identifying  those  locations  in  the  program  which  require  data  transfer  between 
subdomains.  Our  implementation  recognizes  code  that  performs  regular  data  moves  between 
arrays  by  simple  symbolic  analysis  of  array  subscripts  and  checking  the  dataflow  pattern 
in  loop  bodies.  Wdien  the  compiler  detects  that  a  regular  section  of  an  array  is  being 
transferred  into  another  array,  it  removes  the  assignment  statements  from  the  loop  body 
and  inserts  procedure  calls  to  implement  the  data  motion.  Since  the  runtime  library  can 
move  regular  sections  of  data  between  subdomains,  or  within  the  same  subdomain,  this 
technique  is  safe  for  any  parallel  loop  (i.c.  a  loop  with  no  loop-carried  dependencies). 
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subdomain  1 


Figure  6:  Data  Movement  Between  Subdomains 


Table  3:  Intf r-Subdomain  Communication 


Processors 

Transfer 

Scheduling 

Total 

4 

1  Ims 

32  ms 

43ms 

8 

6.3ms 

32ms 

39ms 

16 

4.5ms 

28ms 

33ms 

32 

2.5ms 

25ms 

28ms 

5.2  Performance 

W’c  applied  our  methodology  for  iiiter-subdomain  data  transfers  to  a  simplified  version  of  the 
\avier-Stokes  application  and  benchmarked  the  resulting  code  on  the  iPSC/S60.  Table  3 
shows  the  time  required  to  communicate  data  between  two  40  x  40  x  40  subdomains  for 
different  processor  sizes.  Each  subdomain  is  placed  on  half  of  the  processors,  and  one  40  x  40 
face  is  transferred  from  subdomain  I  to  subdomain  2.  The  column  labeled  “Transfer”  gives 
the  time  required  to  actually  transfer  the  data.  The  “Scheduling”  column  shows  the  time 
required  to  compute  the  communication  schedules.  Since  communication  schedules  can  be 
reused  several  times,  the  asymptotic  performance  is  the  data  transfer  rate. 

,‘\s  a  point  of  reference,  the  time  required  to  complete  one  iteration  of  the  subroutine 
benchmarked  in  Section  4.1  was  on  the  order  of  a  few  hundred  milliseconds;  ranging  from 
■230ms  on  32  processors  to  760ms  for  4  processors.  The  inter-subdomain  communication 
time  is  an  order  of  magnitude  less  than  the  computation  time  for  each  iteration  and  is  not 
an  impediment  to  processing  multiple  subdomains  in  parallel. 

5.3  Partitioning  for  Coarse-Grain  Parallelism 

The  parallelism  between  subdomains  can  be  obtained  by  introducing  partitioning  at  sev¬ 
eral  levels  in  the  program.  Our  implementation  is  based  on  introducing  partitioning  at  a 
relatively  low  level.  With  the  e.xception  of  standard  library  routines,  such  as  sqrt,  sub¬ 
routine  calls  are  run  on  every  processor.  Within  a  subroutine,  loops  are  partitioned  to 
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mirror  tlii'  arr;iy  ril'iition .  as  (l('S(  ril)i^(i  in  Section  ■!.  This  loop  partitioning  results  in 

|)araliel  cxi'ctiilon  of  tin'  looj)  on  all  processors  honiul  to  tin'  subdomain.  In  addition,  con- 
(  iiiTeiit  processing  ol  niultii)le  subdomains  also  develops,  as  processors  bypass  the  loops 
vihi  'li  iteiat('  over  non  local  subdomains.  I'his  method  is  <iuite  effective  at  e.xtracting  the 
( I  >.i  r^e.*yrain('il  |)arallelism  available  in  the  .Vavit'r-. Stokes  a[iplication  because  of  the  large 
amount  ol  computation  within  t'ach  subdomain.  .Since  a  t\'pical  looj)  over  an  individual 
subdomain  re(|uires  si'veial  tiuis  or  pf'ihap.--,  huiidretls  of  milliseconds  on  tin'  i  P.S( '/SfiO.  tin' 
o\<'rhead  associated  with  siibrontiin'  calls  and  rnntinn'  t('sts  tocin'ck  locality  is  insignificant. 

Our  straightforward  a|)proach  is  safe,  and  will  r<‘sult  in  correct  ('xecution  on  dist rif^uted- 
nn'inory  tnulticomputers.  but  it  is  tiot  es[)(‘cially  aggr('ssiv('.  .More  sophist icated  t('chid(iues 
ma\'  In'  re(|uired  to  ('xt  lact  th('  coars(>-grained  paralh'lism  in  programs  with  many,  relatively 
smtill  suf)drmiains.  I'lie  loop  in  f  igure  7  shows  a  simjrlified  version  of  one  of  tin'  main  loops 
tor  tin'  .Xavier-Stokes  apprnation.  I  his  loop  it('rates  over  a  .s('t  of  meshes  (subdomains)  in 
^eipience,  f'or  each  nn'sh.  thre('  subroutines  are  called.  I  ln'  parameti'rs  to  the  subroutines 
are  a  section  of  the  array  .V.  and  tin'  si/(>s  for  eacli  dinn'iision  of  tin'  current  mesh.  .As 
d...i.i.bcd  ill  Sectioii  ‘J.'i.  a  singh>.  one-dimensi(.iial  space  nii.n  is  used  to  hohi  the  diita 
loi'  all  sulidomains.  In  our  implenn'iitation.  every  procr'ssor  t'.xecutc's  this  loi.)p  serially, 
and  (xecutes  evi'ry  subroutim'  call,  lor  any  givt'n  subdomain,  in,  some  processors  will 
()arf irip;il('  in  tlie  computation  of  tf.e  siilirontines.  and  others  will  simply  fall  through  the 
loops;  pi'rlorming  no  iterations  tiecause  they  stor<'  noin'  of  the  I'h'im'nts  of  in. 

Mori'  effiiient  [)arall('iiz;'.l ion  is  possible  if  tlie  unnecessary  subroutine  calls  ciin  be 
avoided  on  individual  processors.  .A  compile-time  mc't hcadology  for  this  can  be  based  can 
interjirocedur.al  regular  sc'ction  analysis.  Ily  (aerforming  reguhir  sec-tion  analvsis.  such  as 
that  described  in  [IlixffO].  [II  Kb  I  j  it  may  be  possible'  to  determine'  that  a  rc'gnhar  region  of 
the  array  A'  is  accessed  wiihin  the  individual  subroutines.  Furtin'r  symbolic  analysis  can 
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do  Hi  —  1 .  iHiiii-cloiiiaiiis 

ciill  iiavioi  (  X(  m<'sh(  Ill )).  isiz('(m),  jsizo(n)).  ksizo(in)) 
call  fliix(  X(  iiif'sli(  m )).  i.sizo(  in ).  jsizr(iii ).  ksize(in)) 
call  st)lv<>(X(inosl\(ni)),  isize(in),  jsize(m),  ksi2G(m)) 
enddo 

subroutine  iiavioi(X.  isz.  jsz,  ksz) 
dimension  X( isz. jsz. ksz ) 

end 

I'iaiiro  7:  I.oo|)  Over  Siilxloinains 

then  b(>  aiiplicd  to  associate  this  section  of  A'  with  the  subdomain  to  which  it  has  been 
aliy;ii('d.  .Vote  that  this  test  may  reciuire  interprocedural  analysis  since  the  distribution  of 
.V  may  have  occurred  in  a  different  subroutine  from  the  loop  shown  in  Figure  7.  Once  the 
compiler  has  determined  that  each  subroutine  invocation  accesses  only  a  single  subdomain, 
it  may  partition  the  loop  according  to  which  subdomains  are  local. 

•An  alternative  approach  to  jiartitioning  this  loop  could  be  based  on  runtime  preprocess¬ 
ing.  Siiici'  the  loop  shown  in  Figure  7  is  executed  many  times  as  part  of  an  outer  sequential 
loop  (see  Section  l.l ),  the  cost  of  determining  the  loop  partitioning  at  runtime  will  likely  be 
an  iiisiguificant  part  of  the  total  e.xecution  time.  However,  even  with  a  runtime  approach, 
some  .inaiysis  must  be  performed  to  ensure  that  the  partitioning  remains  valid  as  this  loop 
is  ri'exc’cuted  at  each  iteration  of  the  outer  loop. 

Moth  of  these  approaches  require  interprocedural  and  symbolic  analysis  that  may  extend 
the  limit  of  w  hat  seems  reasonable  to  expect  from  the  current  generation  of  compilers.  User 
input .  in  the  form  of  additional  directives,  may  be  required  in  order  to  partition  loops  such  as 
the  one  shown  in  Figure  7.  Furthermore,  since  a  high-level  of  coarse-grained  parallelism  can 
still  be  obtained  on  large,  nu merically-intensive  programs  even  when  this  loop  is  executed 
si'qiientially  on  all  processors,  the  most  effective  method  for  extracting  the  inter-subdomain 
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parallelism  from  K'RM  applications  is  an  open  question. 


6  Conclusions 

W'e  have  developed  methods  for  efficiently  executing  ICRM  applications  on  distributed 
memory  multicompulers.  These  applications  are  an  important  class  of  scientific  programs 
with  computational  behavior  requiring  specialized  su[)port  not  available  in  the  current  set 
of  distributed-memory  compilers.  The  fundamental  aspects  of  this  class  of  applications  that 
we  have  addressed  include; 

•  Identifying  the  set  of  functionality  to  be  introduced  at  the  language  level  for  program¬ 
ming  K'R.Ms. 

•  Developing  a  methodology  for  maintaining  several  interacting  subdomains,  each  dis¬ 
tributed  on  a  subset  of  the  total  processors. 

•  Providing  communication  support  both  within  a  subdomain  and  between  subdomains. 

•  Identifying  the  compile-time  requirements  for  embedding  the  communication  support, 
and  for  extracting  parallelism  (both  fine-grained  and  coarse-grained)  from  ICRM  ap¬ 
plications. 

riie  efficacy  of  our  approach  has  been  verified  using  a  rudimentary  compiler  which 
implements  a  set  of  highly  specialized  program  transformations,  and  embeds  procedure 
calls  to  impleiiK'ut  data  motion.  .A  runtime  library  has  been  developed  for  tlu‘  irSC/MiO 
multicomputer.  I  his  library  implements  the  core  set  of  functionality  required  by  IC’RMs, 
and  has  been  tested  using  an  applications  program  developed  at  the  NASA  Langley  Research 
Center.  I  he  communications  overhead  imposed  by  the  runtime  support  has  been  shown  to 
not  be  prohibitive  to  achieving  good  performance  on  the  iRSC/860. 
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